Unsupervised SMT for a low-resourced language pair

نویسندگان

Thi-Ngoc-Diep Do

Laurent Besacier

Eric Castelli

چکیده

This paper presents an unsupervised method in application of extracting parallel sentence pairs from a comparable corpus. A translation system is used to mine the comparable corpus and to withdraw the parallel sentence pairs. An iteration process is implemented not only to increase the number of extracted parallel sentence pairs but also to improve the quality of translation system. A comparison between this unsupervised method and a semi-supervised method is also presented. The unsupervised extracting method was tested in a hard condition: the parallel corpus did not exist and the comparable corpus contained up to 50% of non parallel sentence pairs. However, the result shows that the unsupervised method can be really applied in the case of lacking parallel data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian

This study investigates the possibility of using statistical machine translation to create domainspecific language resources. We propose a methodology that aims to create a domain-specific automatic speech recognition (ASR) system for a low-resourced language when in-domain text corpora are available only in a high-resourced language. Several translation scenarios (both unsupervised and semi-su...

متن کامل

Employing Pivot Language Technique through Statistical and Neural Machine Translation Frameworks: the Case of Under-resourced Persian-spanish Language Pair

The quality of Neural Machine Translation (NMT) systems like Statistical Machine Translation (SMT) systems, heavily depends on the size of training data set, while for some pairs of languages, high-quality parallel data are poor resources. In order to respond to this low-resourced training data bottleneck reality, we employ the pivoting approach in both neural MT and statistical MT frameworks. ...

متن کامل

TmTriangulate: A Tool for Phrase Table Triangulation

This work was supported by the grants no 645452 (QT21) and no 644402 (HimL) of the EU and SVV 260 104 of the Czech Republic. We used language resources hosted by the LINDAT/CLARIN project LM2010013 of the Ministry of Education, Youth and Sports. Introduction Under-resourced language pair: Scarcity of parallel corpora SMT Problem: No direct data → no SMT training Insufficient data → poor SMT per...

متن کامل

Extraction de corpus parallèle pour la traduction automatique depuis et vers une langue peu dotée. (Extraction a parallel corpus for machine translation from and to under-resourced languages)

Nowadays, machine translation has reached good results when applied to several language pairs such as English – French, English – Chinese, English – Spanish, etc. Empirical translation, particularly statistical machine translation allows us to build quickly a translation system if adequate data is available because statistical machine translation is based on models trained from large parallel b...

متن کامل

A Hybrid Machine Translation System Based on a Monotone Decoder

In this paper, a hybrid Machine Translation (MT) system is proposed by combining the result of a rule-based machine translation (RBMT) system with a statistical approach. The RBMT uses a set of linguistic rules for translation, which leads to better translation results in terms of word ordering and syntactic structure. On the other hand, SMT works better in lexical choice. Therefore, in our sys...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

Unsupervised SMT for a low-resourced language pair

نویسندگان

چکیده

منابع مشابه

SMT-based ASR domain adaptation methods for under-resourced languages: Application to Romanian

Employing Pivot Language Technique through Statistical and Neural Machine Translation Frameworks: the Case of Under-resourced Persian-spanish Language Pair

TmTriangulate: A Tool for Phrase Table Triangulation

Extraction de corpus parallèle pour la traduction automatique depuis et vers une langue peu dotée. (Extraction a parallel corpus for machine translation from and to under-resourced languages)

A Hybrid Machine Translation System Based on a Monotone Decoder

عنوان ژورنال:

اشتراک گذاری